Strong Scaling Study

Adam Lyon, Muon g-2 IRMA Analysis, Fermilab, October 2020

This notebook examines strong scaling properties of Julia IRMA jobs making just one plot. I ran jobs with 2,3,4,5,6,7,8,9,10,12,15 and 20 nodes and always with 32 tasks per node. If I examine the timings of a part of the Julia run that includes opening and reading the HDF5 input file, creating the histogram, and running MPI.Reduce and MPI.Gather, I see expected strong scaling. The jobs get faster with more nodes as each task has less of the file to read. If I examine the total elapsed time reported by the batch system, then the scaling is less clear. It almost appears that more nodes make Julia run more slowly. Speculation suggests that perhaps there is contention for loading packages. Next steps could be to try PackageCompiler.jl to make a Julia "app" with fast startup time.

This notebook answers issue Analyze results from Strong Scaling jobs #18 and code may be found in PR #20. This file is IRMA/analyses/018_StrongScaling/StrongScaling.jl.

What is this notebook?

This is a Pluto.jl notebook and the code here is written in Julia. This is like a Jupyter notebook, but with important differences. The most important difference is that the results appear above the code. Another important difference is that Pluto.jl notebooks are reactive. This means that unlike Jupyter notebooks, Pluto.jl notebooks are always in a consistent state. The notebook keeps track of the cell-to-cell dependencies and when a cell changes, the dependent cells update at the same time. This means that while you are looking at a static html representation of the notebook, you can be assured that the notebook is consistent and up-to-date. You'll see that some results have a little triangle next to them. Clicking on that will open an expanded view of the results.

The organization of this notebook is that the main results are replicated at the top, with discussion, in the Results section. The plots are stored in variables which you can see below the plot. You can look in the Code section, which has all code for this notebook, to see how the plot was made.

Introduction

This notebook examines strong scaling properties of my Julia IRMA jobs that make one plot.

On 10/22, I ran IRMA/jobs/003_StrongScaling/strongScalingJob.jl from commit 82b715b answering issue #3. This job reads in the cluster energy data from Muon g-2 era 2D and makes a plot of that energy for all clusters. The data is split evenly among all the MPI ranks. I've tried 2 nodes through 10 nodes. I always choose 32 tasks per node (advice from Marc Paterno). I ran the jobs in the debug queue.

On 10/25 I ran three more jobs with 12, 15 and 20 nodes (still 32 tasks per node) respectively. These jobs ran in the regular queue. Note that I ran the 12 node job twice, the first time it ran I got a strange error, I think due to the CSCRATCH filesystem crashing (It's been a bad month for Cori). The second time I tried it, it ran fine. I ran these jobs in the regular queue, because the debug queue was very full.

All jobs ran on Haswell. Data came from CSCRATCH.

I recorded MPI timings from the Julia run. I also dumped SLURM accounting information for analysis.

Results

There are several types of results.

Histogram comparison

Each rank makes a histogram of cluster energy. All of these histograms are sent to the root rank with MPI.Gather and saved in the output file. Furthermore, I "reduce" the histogram by merging them into one with MPI.Reduce. See Histogram Comparison in the Code section where I compare these histograms to be sure that the reduced one is correct. Note that I'm using a static histogram defined in the IRMA.jl package that MPI can manipulate directory without the need for serialization/deserialization. See the comparison in Code (see above). The tests worked fine.

MPI Timing information

I record the time in the job with an IRMA Stopwatch. The stopwatch uses MPI.Wtime under the hood. The times are recorded as follows.

LabelMeaning
startAfter packages are loaded, functions are defined, and MPI.Init call
openedFileAfter the h5open statement
openedDataSetAfter the energy dataset is opened (but no data read yet)
determineRangesAfter the ranges to examine are determined with partitionDS
readDataSetAfter the dataset is read (this reads the actual data for the rank)
makeHistogramAfter the data is histogrammed
gatheredHistogramsAfter all the histograms have been gathered to the root rank
reducedHistogramsAfter the histograms have been reduced to the root rank
gatherRankLogsAfter the rank's log info has been gathered to the root rank

Timing of anything before start is not recorded.

Let's look at the timing plots. Note that you can see all the timing plots in the Code section. Some representative plots will be reproduced here.

Here is the run using two nodes (and 32 ranks per node).

70.1 μs
11.8 ms

First, you see that each rank read 263,062,959 rows, or one less. Just opening the file took over 5 seconds, with the root rank taking a little bit longer. Reading the dataset itself takes the majority of the time. The spread of times over ranks is rather interesting. Some ranks took a bit longer to make the histograms. Some ranks were signficantly slower in the MPI.Gather, though the structure perhaps shows some clever consolidating in MPI. Reducing the histograms involves every other rank taking about a second longer than the others. Gathering the rank logs is very fast. You see all of this structure in the total MPI time.

Let's look at ten nodes

9.7 μs
12 ms

There's quite a bit of structure in these plots. Note that for readDataSet, where the data is actually read in, some nodes seem appear to be faster than others. Given that the reading is fast, but its time still dominates, the structure of the reducedHistograms becomes clear in the total time.

Here are twenty nodes

10 μs
12.6 ms

Here is a box plot of the total time vs. number of nodes...

7.1 μs
1.4 μs

Since the job is as fast as the slowest rank, let's determine the maxium total time. We can then also determine the predicted cost of the job.

7.5 μs
1.3 μs

Information from the batch system

The above, however, is not the whole story, it seems. I can also get timing information from the batch system with the sacct command. I've done this for the jobs run here. This plot,

9.5 μs
836 ns

compares the MPI total time (green points) to the total time the batch system reported for the Julia srun step (orange points). The elapsed time for the entire batch job is shown in the blue points. For 6 nodes, the Julia time is longer than the total batch time - that seems nonsensical.

There is a very large discrepancy between the MPI total time and the total Julia and batch time. Furthermore, the total Julia/Batch time for 12, 15 and 20 nodes is markedly higher. That may be due to running in the regular queue, though I can't think of a good reason for this.

The significant difference between the batch times and the MPI time must be Julia startup, package loading and initializing MPI. I don't have specific timings for those steps. Further investigation is required.

Let's look at the costs for these jobs as computed from the sacct data. They match the cost data in IRIS.

10.2 μs
2 μs

Let's look at other information from the accounting data.

Here is the maximum RSS reported by a task in MB.

9.1 μs
1.1 μs

Here is the maximum VM size. Note that the scale is GB. This looks almost reasonsble, except for the big dip for 15 nodes.

6.8 μs
678 ns

We can look at the maximum bytes read (note scale is MB)...

6.6 μs
782 ns

and maximum bytes written. (not scale is KB)...

7 μs
1.1 μs

Conclusions

It is clear that there is a significant fraction of the Julia run that I am not including in my MPI timing. When I look at the scaling of timing reported by the batch system, the scaling is nonsensical, especially for the 12, 15 and 20 node runs. Could that be due to running them in the regular queue instead of the debug queue? How much time does it take to start Julia, load packages and initialize MPI? Could there be contention for disk when Julia is starting and packages are loaded?

Next steps

Try to add timing information for loading packages and initializing MPI. Perhaps try PackageCompiler.jl to speed up loading. Run everything in the debug queue.

26.2 μs

Code

4.2 μs
2.3 ms
1 s
11.1 s
"/Users/lyon/Development/gm2/data/003_StrongScaling/"
2 μs
histoFiles
377 ms

One file plots

Get our footting by looking at one results file.

103 μs
d
2.1 s
Base.KeySet for a Dict{String,Any} with 4 entries. Keys:
  "allTimings"
  "allHistos"
  "oneHisto"
  "allRankLogs"
8.4 μs
Histogram Comparison
6.1 μs

Let's look at the histogram information. We have a set of histograms and we have the reduced histogram. They should match.

15.6 μs
allHistos
5.3 μs
289 μs
288
7.1 μs
allHistsO
397 μs
23.2 s
456 μs
71.1 μs
14.5 ms
allHistsOS
Hist: n=16836029329 | value=(x = 0.0f0:25.0f0:5000.0f0, y = [113229, 2573686, 19013147, 76776828, 191017207, 280871996, 323201031, 328332785, 312998903, 293851706  …  4609, 4537, 4301, 4147, 3860, 3530, 3380, 3123, 2974, 2709])
1.1 ms
oneHist
SHist: n=16836029329 | value=(x = 0.0f0:25.0f0:5000.0f0, y = [113229, 2573686, 19013147, 76776828, 191017207, 280871996, 323201031, 328332785, 312998903, 293851706  …  4609, 4537, 4301, 4147, 3860, 3530, 3380, 3123, 2974, 2709])
5.4 μs
607 ms
107 ms
Test Summary:                    | Pass  Total
MPI Reduced Histogram is correct |    3      3
51.2 ms
2.2 ms
200
6.6 μs

Let's look at timing information

7.5 μs
rt
186 μs
timingPlotsForRun (generic function with 1 method)
63 μs
h2x32
10.8 ms
rtSum
16.9 μs
1.2 ms
rankLogs
4.2 μs
41.3 ms
992 μs
581 μs

DataFrame all the things

Let's look at all of the output files and put the data into a DataFrame

7.4 μs
1.2 s
768 ns
extractNNodesFromFileName (generic function with 1 method)
38.8 μs
61.9 ms
dataFrameFromRankData (generic function with 1 method)
91 μs

Try processing one file...

13 μs
1.4 s
479 ms
2.2 s

Now process all of the files...

6.6 μs
3.1 s
numNodesranknumRowsopenedFileopenedDataSetdetermineRangesreadDataSetmadeHistogramgatheredHistogramsreducedHistogramsgatheredRankLogstotalTime
202630629595.65980601310730.00554299354553222660.021469116210937534.768538951873784.4992001056671142.81411385536193851.2421591281890870.0810008049011230549.09183096885681
212630629595.4036059379577640.0145299434661865230.01240801811218261734.345253944396974.4423091411590580.52532792091369630.151950120925903320.0895719528198242244.98495697975159
222630629595.3836429119110110.0364830493927001950.01134204864501953134.3691899776458744.4262950420379640.49798703193664551.26147389411926270.0827441215515136746.06915807723999
232630629595.38748311996460.032504081726074220.01167583465576171934.257301092147834.4387400150299070.50105285644531250.133204221725463870.0834219455718994144.845383167266846
242630629595.39457702636718750.025449991226196290.01137709617614746135.3157358169555664.4153540134429930.84529399871826171.25350213050842290.0855238437652587947.346813917160034
252630629595.3954639434814450.0246880054473876950.01151800155639648435.332707881927494.5906009674072270.49572420120239260.132436990737915040.0812499523162841846.06438994407654
262630629595.3841569423675540.0362968444824218750.01125621795654296935.394374847412114.5356531143188480.64620590209960941.23811411857604980.084105968475341847.33016395568848
272630629595.39572000503540.0246388912200927730.01158404350280761735.657882928848274.431483983993530.486690044403076170.13098311424255370.0818059444427490246.22078895568848
282630629595.38004589080810550.040620088577270510.01131701469421386734.667706966400154.5056920051574712.2649369239807131.2490940093994140.0803639888763427748.19977688789368
292630629595.3806698322296140.040009975433349610.01116204261779785235.449152946472174.6077160835266110.4808161258697510.129895925521850590.0791549682617187546.17857789993286
86.8 ms
947 ms
theNumNodes
207 ms
plotsForRun (generic function with 1 method)
52.8 μs

With the slider below, you can choose which run to view. Note that this is a little glitchy - if not all of the plots appear, then adjust the slider and go back.

2.9 μs
Timing plots of ranks
6.1 μs
23.8 ms

Plots for run with 2 nodes (32 ranks per node)

8.4 μs
12.4 ms

Plot the scaling...

7.5 μs
strongScalingPlot
20.8 ms

Determine the maximium total time for each run, because the job is only as fast as the slowest rank.

8.6 μs
totalMPITimes

12 rows × 2 columns

numNodestotalTime_maximum
Int64Float64
1249.0918
2336.3385
3428.9671
4527.3876
5625.4913
6723.771
7820.118
8919.6573
91018.4349
101216.7009
111514.628
122013.713
408 μs
maxMPITimesPlot
114 ms
321 ms
22.9 ms
strongScalingCostPlot
19.3 ms

Do this for the maximum times

6.8 μs

12 rows × 3 columns

numNodestotalTime_maximumcost
Int64Float64Float64
1249.09183.81825
2336.33854.23949
3428.96714.50599
4527.38765.32537
5625.49135.94797
6723.7716.471
7820.1186.25893
8919.65736.88006
91018.43497.16913
101216.70097.79376
111514.6288.533
122013.71310.6657
382 ms
costMPIPlot
20.5 ms

Examine accounting information

Examining the raw timing information from within MPI is not the whole story. Let's look at the Cori accounting information. I can do that by running ~/bin/sacct_csv in my NERSC directory. I've copied the output here.

18.8 μs
17.4 s
sacct

113 rows × 25 columns (omitted printing of 18 columns)

JobIDJobNameQOSStateExitCodeNNodesNCPUS
StringStringString?StringStringInt64Int64
135415096run_strongScalingJob.shdebug_hswFAILED15:0164
235415096.batchbatchmissingFAILED15:0164
335415096.externexternmissingCOMPLETED0:0164
435415096.0juliamissingFAILED15:0132
535415110run_strongScalingJob.shdebug_hswFAILED1:0164
635415110.batchbatchmissingFAILED1:0164
735415110.externexternmissingCOMPLETED0:0164
835415110.0juliamissingFAILED1:0132
935415167run_strongScalingJob.shdebug_hswFAILED1:0164
1035415167.batchbatchmissingFAILED1:0164
1135415167.externexternmissingCOMPLETED0:0164
1235415167.0juliamissingFAILED1:0132
1335415240run_strongScalingJob.shdebug_hswOUT_OF_MEMORY0:125164
1435415240.batchbatchmissingOUT_OF_MEMORY0:125164
1535415240.externexternmissingCOMPLETED0:0164
1635415240.0juliamissingOUT_OF_MEMORY0:125132
1735415285run_strongScalingJob.shdebug_hswFAILED15:02128
1835415285.batchbatchmissingFAILED15:0164
12.8 ms

We can pull the job IDs from the log files...

241 μs
slurmLogFiles
1.6 ms
jobIdFromSlurmLogName (generic function with 1 method)
63.8 μs
slurmIds
37.2 μs

24 rows × 25 columns (omitted printing of 17 columns)

JobIDJobNameQOSStateExitCodeNNodesNCPUSLayout
StringStringString?StringStringInt64Int64String?
135415305run_strongScalingJob.shdebug_hswCOMPLETED0:02128missing
235415305.0juliamissingCOMPLETED0:0264Block
335415351run_strongScalingJob.shdebug_hswCOMPLETED0:03192missing
435415351.0juliamissingCOMPLETED0:0396Block
535415425run_strongScalingJob.shdebug_hswCOMPLETED0:04256missing
635415425.0juliamissingCOMPLETED0:04128Block
735415476run_strongScalingJob.shdebug_hswCOMPLETED0:05320missing
835415476.0juliamissingCOMPLETED0:05160Block
935415515run_strongScalingJob.shdebug_hswCOMPLETED0:06384missing
1035415515.0juliamissingCOMPLETED0:06192Block
1135415541run_strongScalingJob.shdebug_hswCOMPLETED0:07448missing
1235415541.0juliamissingCOMPLETED0:07224Block
1335415576run_strongScalingJob.shdebug_hswCOMPLETED0:08512missing
1435415576.0juliamissingCOMPLETED0:08256Block
1535415699run_strongScalingJob.shdebug_hswCOMPLETED0:09576missing
1635415699.0juliamissingCOMPLETED0:09288Block
1735415777run_strongScalingJob.shdebug_hswCOMPLETED0:010640missing
1835415777.0juliamissingCOMPLETED0:010320Block
303 ms
1.4 ms

12 rows × 25 columns (omitted printing of 15 columns)

JobIDJobNameQOSStateExitCodeNNodesNCPUSLayoutCPUTimeRAWElapsedRaw
StringStringString?StringStringInt64Int64String?Int64Int64
135415305.0juliamissingCOMPLETED0:0264Block512080
235415351.0juliamissingCOMPLETED0:0396Block604863
335415425.0juliamissingCOMPLETED0:04128Block806463
435415476.0juliamissingCOMPLETED0:05160Block880055
535415515.0juliamissingCOMPLETED0:06192Block1171261
635415541.0juliamissingCOMPLETED0:07224Block1187253
735415576.0juliamissingCOMPLETED0:08256Block1356853
835415699.0juliamissingCOMPLETED0:09288Block1382448
935415777.0juliamissingCOMPLETED0:010320Block1600050
1035453280.0juliamissingCOMPLETED0:012384Block2956877
1135506920.0juliamissingCOMPLETED0:015480Block2976062
1235506934.0juliamissingCOMPLETED0:020640Block4416069
245 ms
1 ms

12 rows × 26 columns (omitted printing of 16 columns)

JobIDJobNameQOSStateExitCodeNNodesNCPUSLayoutCPUTimeRAWElapsedRaw
StringStringString?StringStringInt64Int64String?Int64Int64
135415305.0juliamissingCOMPLETED0:0264Block512080
235415351.0juliamissingCOMPLETED0:0396Block604863
335415425.0juliamissingCOMPLETED0:04128Block806463
435415476.0juliamissingCOMPLETED0:05160Block880055
535415515.0juliamissingCOMPLETED0:06192Block1171261
635415541.0juliamissingCOMPLETED0:07224Block1187253
735415576.0juliamissingCOMPLETED0:08256Block1356853
835415699.0juliamissingCOMPLETED0:09288Block1382448
935415777.0juliamissingCOMPLETED0:010320Block1600050
1035453280.0juliamissingCOMPLETED0:012384Block2956877
1135506920.0juliamissingCOMPLETED0:015480Block2976062
1235506934.0juliamissingCOMPLETED0:020640Block4416069
255 ms

So the elapsed time is exactly the CPU seconds per task

7.2 μs

12 rows × 2 columns

ElapsedRawCPUTimePerRank
Int64Float64
18080.0
26363.0
36363.0
45555.0
56161.0
65353.0
75353.0
84848.0
95050.0
107777.0
116262.0
126969.0
26.1 μs

Here's the total elapsed time. This looks goofy.

6.9 μs
totalBatchTimePlot
15.1 ms

12 rows × 26 columns (omitted printing of 18 columns)

JobIDJobNameQOSStateExitCodeNNodesNCPUSLayout
StringStringString?StringStringInt64Int64String?
135415305run_strongScalingJob.shdebug_hswCOMPLETED0:02128missing
235415351run_strongScalingJob.shdebug_hswCOMPLETED0:03192missing
335415425run_strongScalingJob.shdebug_hswCOMPLETED0:04256missing
435415476run_strongScalingJob.shdebug_hswCOMPLETED0:05320missing
535415515run_strongScalingJob.shdebug_hswCOMPLETED0:06384missing
635415541run_strongScalingJob.shdebug_hswCOMPLETED0:07448missing
735415576run_strongScalingJob.shdebug_hswCOMPLETED0:08512missing
835415699run_strongScalingJob.shdebug_hswCOMPLETED0:09576missing
935415777run_strongScalingJob.shdebug_hswCOMPLETED0:010640missing
1035453280run_strongScalingJob.shregular_1COMPLETED0:012768missing
1135506920run_strongScalingJob.shregular_1COMPLETED0:015960missing
1235506934run_strongScalingJob.shregular_1COMPLETED0:0201280missing
215 ms
batchJobCostPlot
30.4 ms

Look at how long Julia itself took

6.9 μs
juliaElapsedTimePlot
14.5 ms

So, Julia is taking significantly more time than what I recorded in MPI

15.5 μs
45.4 ms
1.3 μs

Note that for 4 nodes, the julia time plots on top of the batch time point. So there's a lot going on that's not accounted for in my MPI timing.

7.7 μs
12.8 μs
765 μs

Here's a function to read in things like "100K", "150.6M", "2.5G" and output a Float in Megabytes.

5.9 μs
parseMemoryToMB (generic function with 1 method)
33.3 μs

Write some tests...

6.9 μs
Test Summary:        | Pass  Total
Test parseMemoryToMB |    4      4
54.6 ms
maxVmsizePlot
16.8 ms
maxRSSPlot
15.2 ms
maxDiskReadPlot
19.9 ms
maxDiskWrite
15.9 ms